Search CORE

HAL-IN2P3

HAL-MINES ParisTech

Automated Code Generation for Lattice QCD Simulation

Author: Barthou Denis
Brand-Foissac Olivier
Dolbeau Romain
Eisenbeis Christine
Grosdidier Gilbert
Kruse Michael
Petrov Konstantin
Pène Olivier
Tadonki Claude
Publication venue: HAL CCSD
Publication date: 15/12/2013
Field of study

Quantum Chromodynamics (QCD) is the theory of strong nuclear force, responsible of the interactions between sub-nuclear particles. QCD simulations are typically performed through the lattice gauge theory approach, which provides a discrete analytical formalism called LQCD (Lattice Quantum Chromodynamics). LQCD simulations usually involve generating and then processing data on petabyte scale which demands multiple teraflop-years on supercomputers. Large parts of both, generation and analysis, can be reduced to the inversion of an extremely large matrix, the so-called Wilson-Dirac operator. For this purpose, and because this matrix is always sparse and structured, iterative methods are definitely considered. Therefore, the procedure of the application of this operator, resulting in a vector-matrix product, appears as a critical computation kernel that should be optimized as much as possible. Evaluating the Wilson-Dirac operator involves symmetric stencil calculations where each node has 8 neighbors. Such configuration is really hindering when it comes to memory accesses and data exchanges among processors. For current and future generation of supercomputers the hierarchical memory structure make it next to impossible for a physicist to write an efficient code. Addressing these issues in other to harvest an acceptable amount of computing cycles for the real need, which means reaching a good level of efficiency, is the main concern of this paper. We present here a Domain Specific Language and corresponding toolkit, called QIRAL, which is a complete solution from symbolic notation to simulation code

HAL - Lille 3

HAL-IN2P3

HAL-MINES ParisTech

HAL-CEA

Automated Code Generation for Lattice QCD Simulation

Author: Barthou Denis
Brand-Foissac Olivier
Dolbeau Romain
Eisenbeis Christine
Grosdidier Gilbert
Kruse Michael
Petrov Konstantin
Pène Olivier
Tadonki Claude
Publication venue: HAL CCSD
Publication date: 15/12/2013
Field of study

CASH: Revisiting hardware sharing in single-chip parallel processor

Author: André Seznec
Romain Dolbeau
Publication venue
Publication date: 01/01/2002
Field of study

As the increasing of issue width has diminishing returns with superscalar processor, thread parallelism with a single chip is becoming a reality. In the past few years, both SMT (Simultaneous MultiThreading) and CMP (Chip MultiProcessor) approaches were first investigated by academics and are now implemented by the industry. In some sense, CMP and SMT represent two extreme design points

CiteSeerX

CASH Design Space Exploration

Author: Dolbeau Romain
He Liqiang
Seznec André
Publication venue: HAL CCSD
Publication date: 01/01/2006
Field of study

As the increasing of issue width has diminishing returns with superscalar processor, thread parallelism with a single chip is becoming a reality. In the past few years, both SMT and CMP approaches were first investigated by academics and are now implemented by the industry. In some sense, SMT and CMP represent two extreme design points. CASH parallel processor (for CMP And SMT Hybird) is a possible intermediate design points for on-chip thread parallelism in terms of design complexity and hardware sharing. It retains resource sharing as SMT when such a sharing can be made non-critical for implementation, but resource splitting as CMP wherever resource sharing leads to a superlinear increase of the implementation hardware complexity. This paper explores the multi-dimensional design space for CASH architecture. It compares the performance of single thread running on CASH, SMT and CMP processors. And then the performances of multi-program workloads and parallel workloads are investigated in these processors. At last, It explores the performance varies on CASH with the changing of cache size, and number of associativity of cache. The experiment results show that the CASH processor has a great potential to improve the performances of single thread workload and most of the multi-program workloads, and at the same time maintains a low implementation complexity than the SMT and CMP

An Hybrid Data Transfer Optimization Technique for GPGPU

Author: Bodin François
Dolbeau Romain
Petit Eric
Publication venue: HAL CCSD
Publication date: 01/01/2007
Field of study

Graphical Processing Units (GPU) can provide tremendous computing power. Current NVidia and ATI hardware display a peak performance of hundreds of gigaflops. However, because of the data transfer speed between CPU and GPU is limited, those devices are difficult to use to accelerate numerical applications. In this paper we propose a software hybrid technique for automatically optimizing data transfer based on static and dynamic information on data accesses

One OpenCL to Rule Them All?

Author: Bodin François
Colin De Verdiere Guillaume
Romain Dolbeau
Publication venue: HAL CCSD
Publication date: 07/09/2013
Field of study

International audienc

HAL-CEA